W6 Lab Assignment

Deep dive into Histogram and boxplot.


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

sns.set_style('white')

%matplotlib inline

Histogram

Let's revisit the table from the class

Hours Frequency
0-1 4,300
1-3 6,900
3-5 4,900
5-10 2,000
10-24 2,100

You can draw a histogram by just providing bins and counts instead of a list of numbers. So, let's do that for convenience.


In [2]:
bins = [0, 1, 3, 5, 10, 24]
data = {0.5: 4300, 2: 6900, 4: 4900, 7: 2000, 15: 2100}

Draw histogram using this data. Useful query: Google search: matplotlib histogram pre-counted


In [3]:
# TODO: draw a histogram with pre-counted data. 
#plt.xlabel("Hours")
val, weight = zip(*[(k, v) for k,v in data.items()])
plt.hist(val, weights=weight, bins = bins)
plt.xlabel("Hours")


Out[3]:
<matplotlib.text.Text at 0x7fd3f21d2390>

As you can see, the default histogram does not normalize with binwidth and simply shows the counts! This can be very misleading if you are working with variable bin width. One simple way to fix this is using the option normed.


In [4]:
# TODO: fix it with normed option. 
plt.hist(val, weights=weight, bins = bins, normed = True)


Out[4]:
(array([ 0.21287129,  0.17079208,  0.12128713,  0.01980198,  0.00742574]),
 array([ 0,  1,  3,  5, 10, 24]),
 <a list of 5 Patch objects>)

IMDB data

How does matplotlib decide the bin width? Let's try with the IMDb data.


In [5]:
# TODO: Load IMDB data into movie_df using pandas
movie_df = pd.read_csv('imdb.csv', delimiter='\t')
movie_df.head()


Out[5]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61
2 #7DaysLater 2013 7.1 14
3 #Bikerlive 2014 6.8 11
4 #ByMySide 2012 5.5 13

Plot the histogram of movie ratings using the plt.hist() function.


In [6]:
plt.hist(movie_df['Rating'])


Out[6]:
(array([   824.,   3363.,   9505.,  21207.,  42500.,  69391.,  86470.,
         58059.,  21538.,    154.]),
 array([ 1.  ,  1.89,  2.78,  3.67,  4.56,  5.45,  6.34,  7.23,  8.12,
         9.01,  9.9 ]),
 <a list of 10 Patch objects>)

Have you noticed that this function returns three objects? Take a look at the documentation here to figure out what they are.

To get the returned three objects:


In [7]:
n_raw, bins_raw, patches = plt.hist(movie_df['Rating'])
print(n_raw)
print(bins_raw)


[   824.   3363.   9505.  21207.  42500.  69391.  86470.  58059.  21538.
    154.]
[ 1.    1.89  2.78  3.67  4.56  5.45  6.34  7.23  8.12  9.01  9.9 ]

Actually, n_raw contains the values of histograms, i.e., the number of movies in each of the 10 bins. Thus, the sum of the elements in n_raw should be equal to the total number of movies:


In [8]:
# TODO: test whether the sum of the numbers in n_raw is equal to the number of movies. 
sum(n_raw)==len(movie_df)


Out[8]:
True

The second returned object (bins_raw) is a list containing the edges of the 10 bins: the first bin is [1.0,1.89], the second [1.89,2.78], and so on. We can calculate the width of each bin.


In [9]:
# TODO: calculate the width of each bin and print them. 
for i in range(len(bins_raw)-1):
    print (bins_raw[i+1] - bins_raw[i])


0.89
0.89
0.89
0.89
0.89
0.89
0.89
0.89
0.89
0.89

The above for loop can be conveniently rewritten as the following, using list comprehension and the zip() function. Can you explain what's going on inside the zip?


In [10]:
[ j-i for i,j in zip(bins_raw[:-1],bins_raw[1:]) ]


Out[10]:
[0.89000000000000012,
 0.89000000000000012,
 0.88999999999999968,
 0.89000000000000057,
 0.88999999999999968,
 0.88999999999999968,
 0.89000000000000057,
 0.89000000000000057,
 0.88999999999999879,
 0.89000000000000057]
Make an iterator that aggregates elements from each of the iterables. Returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The iterator stops when the shortest input iterable is exhausted. Basically, the zip function combines the sequences.

Noticed that the width of each bin is the same? This is equal-width binning. We can calculate the width as:


In [11]:
min_rating = min(movie_df['Rating'])
max_rating = max(movie_df['Rating'])
print(min_rating, max_rating)
print( (max_rating-min_rating) / 10 )


1.0 9.9
0.89

Now, let's plot the histogram where the y axis is normed.


In [12]:
n, bins, patches = plt.hist(movie_df['Rating'], normed=True)
print(n)
print(bins)


[ 0.00295786  0.01207195  0.03411949  0.07612541  0.15255952  0.24908842
  0.31039581  0.20841067  0.07731358  0.0005528 ]
[ 1.    1.89  2.78  3.67  4.56  5.45  6.34  7.23  8.12  9.01  9.9 ]

In this case, the edges of the 10 bins do not change. But now n represents the heights of the bins. Can you verify that matplotlib has correctly normed the heights of the bins?

Hint: the area of each bin should be equal to the fraction of movies in that bin.


In [13]:
# TODO: verify that it is properly normalized. 
normalizeList = []
for i in range(len(bins)):
    try:
        Moviesbins = movie_df[(movie_df['Rating'] >= bins[i]) & (movie_df['Rating'] <= bins[i+1])]
        normalizeList.append(round(len(Moviesbins)/len(movie_df), 4))
    except IndexError:
        pass
print("Bin widths", normalizeList)
print("Data from histogram", n)


Bin widths [0.0026, 0.0107, 0.0304, 0.0678, 0.1358, 0.2217, 0.2763, 0.1855, 0.0688, 0.0005]
Data from histogram [ 0.00295786  0.01207195  0.03411949  0.07612541  0.15255952  0.24908842
  0.31039581  0.20841067  0.07731358  0.0005528 ]

Selecting binsize

A nice to way to explore this is using the "small multiples" with a set of sample bin sizes. In other words, pick some binsizes that you want to see and draw many plots within a single "figure". Read about subplot. For instance, you can do something like:


In [14]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
movie_df['Rating'].hist(bins=3)
plt.subplot(1,2,2)
movie_df['Rating'].hist(bins=100)


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd3ee5402e8>

What does the argument in plt.subplot(1,2,1) mean?
http://stackoverflow.com/questions/3584805/in-matplotlib-what-does-the-argument-mean-in-fig-add-subplot111

Ok, so create 8 subplots (2 rows and 4 columns) with the given binsizes.


In [15]:
binsizes = [2, 3, 5, 10, 30, 40, 60, 100 ]

plt.figure(1, figsize=(18,8))

for i, bins in enumerate(binsizes):     
    # TODO: use subplot and hist() function to draw 8 plots
    plt.subplot(2, 4, i + 1)
    movie_df['Rating'].hist(bins = bins)
    plt.title("Bin size " + str(bins))


Do you notice weird patterns that emerge from bins=40? Can you guess why do you see such patterns? What are the peaks and what are the empty bars? What do they tell you about choosing the binsize in histograms?

# TODO: Provide your answer and evidence here The weird patterns are result of increase in number of bins and decrease in bins width. The peaks in the graph represents frequency of data of that particular bin and the empty bars shows that there is zero data for that particular bin. They tell us that correct bin size is important in histogram to avoid misinterpretation of the data. To choose correct or optimal bin size we can choose either of following formula: 1. k = square root(n) 2. Sturge's Formula: [log2(n) + 1] 3. Freedman-Diaconis' choice: 2*(IQR(x)/cube_root(n))

Now, let's try to apply several algorithms for finding the number of bins.


In [16]:
N = len(movie_df['Rating'])

# TODO: plot three histograms based on three formulae

plt.figure(figsize=(12,4))


# Sqrt 
nbins = int(np.sqrt(N))

plt.subplot(1,3,1)
plt.hist(movie_df['Rating'], bins = nbins)
plt.title("SQRT, {0} bins".format(nbins))

# Sturge's formula
plt.subplot(1,3,2)
nbins = int(np.ceil(np.log2(N) + 1))
plt.hist(movie_df['Rating'], bins = nbins)
plt.title("Sturge's, {0} bins".format(nbins))

# Freedman-Diaconis
plt.subplot(1,3,3)
data = movie_df['Rating'].order()
iqr = np.percentile(data, 75) - np.percentile(data, 25)
width = 2*iqr/np.power(N, 1/3)
nbins = int((max(data) - min(data)) / width)
plt.hist(movie_df['Rating'], bins = nbins)
plt.title("Freedman-Diaconis, {0} bins".format(nbins))


Out[16]:
<matplotlib.text.Text at 0x7fd3ec569be0>

Investigating the anomalies in the histogram

Let's investigate the anormalies in the histogram.


In [17]:
# TODO: draw the histogram with 120 bins
n, bins, patches = plt.hist(movie_df['Rating'], bins = 120)
plt.title("Histogram with bins 120")
plt.xlabel("Rating")
plt.ylabel("Frequency")


Out[17]:
<matplotlib.text.Text at 0x7fd3eb3d29e8>

We can locate where the empty bins are, by checking whether the value in the n is zero or not.


In [18]:
# TODO: print out bins that doesn't contain any values. Check whether they fall into range like [1.8XX, 1.8XX]
# useful zip: zip(bins[:-1], bins[1:], n)  what does this do?
zip_values = zip(bins[:-1], bins[1:], n)
print("Range with value zero's are as follows")
for i in zip_values:
    if i[2] == 0:
        print([i[0], i[1]])
        if str(i[0])[:3] == str(i[1])[:3]:
            print("They fall in range")


Range with value zero's are as follows
[1.2225000000000001, 1.2966666666666666]
They fall in range
[1.5191666666666666, 1.5933333333333333]
They fall in range
[1.8158333333333334, 1.8900000000000001]
They fall in range
[2.1124999999999998, 2.1866666666666665]
They fall in range
[2.4091666666666667, 2.4833333333333334]
They fall in range
[2.7058333333333335, 2.7800000000000002]
They fall in range
[3.0024999999999999, 3.0766666666666667]
They fall in range
[3.2250000000000001, 3.2991666666666668]
They fall in range
[3.5216666666666669, 3.5958333333333337]
They fall in range
[3.8183333333333334, 3.8925000000000001]
They fall in range
[4.1150000000000002, 4.1891666666666669]
They fall in range
[4.4116666666666671, 4.4858333333333338]
They fall in range
[4.7083333333333339, 4.7825000000000006]
They fall in range
[5.0049999999999999, 5.0791666666666666]
They fall in range
[5.3016666666666667, 5.3758333333333335]
They fall in range
[5.5241666666666669, 5.5983333333333336]
They fall in range
[5.8208333333333337, 5.8950000000000005]
They fall in range
[6.1175000000000006, 6.1916666666666673]
They fall in range
[6.4141666666666675, 6.4883333333333342]
They fall in range
[6.7108333333333334, 6.7850000000000001]
They fall in range
[7.0075000000000003, 7.081666666666667]
They fall in range
[7.3041666666666671, 7.3783333333333339]
They fall in range
[7.600833333333334, 7.6750000000000007]
They fall in range
[7.8233333333333341, 7.8975000000000009]
They fall in range
[8.120000000000001, 8.1941666666666677]
They fall in range
[8.4166666666666679, 8.4908333333333346]
They fall in range
[8.7133333333333347, 8.7875000000000014]
They fall in range
[9.0099999999999998, 9.0841666666666665]
They fall in range
[9.3066666666666666, 9.3808333333333334]
They fall in range
[9.6033333333333335, 9.6775000000000002]
They fall in range

In [19]:
# TODO: draw the histogram with 120 bins
n, bins, patches = plt.hist(movie_df['Rating'], bins = 120)
plt.title("Histogram with bins 120")
plt.xlabel("Rating")
plt.ylabel("Frequency")


Out[19]:
<matplotlib.text.Text at 0x7fd3ead52588>

One way to identify the peak is comparing the number to the next bin and see whether it is much higher than the next bin.


In [20]:
# TODO: identify peaks and print the bins with the peaks 
# e.g. 
# [1.0, 1.1]
# [1.3, 1.4]
# [1.6, 1.7]
# ...
#
# you can use zip again like zip(bins[:-1], bins[1:]  ... ) to access the data in two adjacent bins.
values = list(zip(bins[:-1], bins[1:], n))
print("Bin with peaks are as follows")
for i in range(1, len(values)):
    try:
        if ((values[i][2] > values[i-1][2]) and (values[i][2] > values[i+1][2])):
            print([values[i][0], values[i][1]])
    except IndexError:
        pass


Bin with peaks are as follows
[1.1483333333333334, 1.2225000000000001]
[1.3708333333333333, 1.4450000000000001]
[1.5933333333333333, 1.6675]
[1.7416666666666667, 1.8158333333333334]
[1.9641666666666668, 2.0383333333333331]
[2.1866666666666665, 2.2608333333333333]
[2.335, 2.4091666666666667]
[2.5575000000000001, 2.6316666666666668]
[2.7800000000000002, 2.854166666666667]
[2.9283333333333337, 3.0024999999999999]
[3.1508333333333334, 3.2250000000000001]
[3.3733333333333335, 3.4475000000000002]
[3.5958333333333337, 3.6700000000000004]
[3.7441666666666671, 3.8183333333333334]
[3.9666666666666668, 4.0408333333333335]
[4.1891666666666669, 4.2633333333333336]
[4.3375000000000004, 4.4116666666666671]
[4.5600000000000005, 4.6341666666666672]
[4.7825000000000006, 4.8566666666666674]
[4.9308333333333341, 5.0049999999999999]
[5.1533333333333333, 5.2275]
[5.3758333333333335, 5.4500000000000002]
[5.5983333333333336, 5.6725000000000003]
[5.746666666666667, 5.8208333333333337]
[5.9691666666666672, 6.0433333333333339]
[6.1916666666666673, 6.265833333333334]
[6.3400000000000007, 6.4141666666666675]
[6.5625, 6.6366666666666667]
[6.7850000000000001, 6.8591666666666669]
[6.9333333333333336, 7.0075000000000003]
[7.1558333333333337, 7.2300000000000004]
[7.3783333333333339, 7.4525000000000006]
[7.5266666666666673, 7.600833333333334]
[7.7491666666666674, 7.8233333333333341]
[7.9716666666666676, 8.0458333333333343]
[8.1941666666666677, 8.2683333333333344]
[8.4908333333333346, 8.5650000000000013]
[8.7875000000000014, 8.8616666666666681]
[9.0841666666666665, 9.1583333333333332]
[9.3808333333333334, 9.4550000000000001]
[9.7516666666666669, 9.8258333333333336]

Ok. They doesn't necessarilly cover the integer values. Let's see the minimum number of votes.


In [21]:
movie_df.describe()


Out[21]:
Year Rating Votes
count 313011.000000 313011.000000 313011.000000
mean 1988.418334 6.296195 1691.231775
std 26.636414 1.363866 18593.708570
min 1874.000000 1.000000 5.000000
25% 1974.000000 5.500000 10.000000
50% 1999.000000 6.500000 25.000000
75% 2009.000000 7.300000 110.500000
max 2017.000000 9.900000 1511933.000000

Ok, the minimum number of votes is 5 not 1. IMDB may only keep the rating information for movies with at least 5 votes. This may explain why the most frequent ratings are like 6.4 and 6.6. Let's plot the histogram with only the rows with 5 votes. Set the binsize 30.


In [22]:
# TODO: plot the histogram only with ratings that have the minimum number of votes. 
df = movie_df[movie_df['Votes'] == 5]
plt.hist(df['Rating'], bins = 30)
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.title("Histogram of rating with min no of votes")


Out[22]:
<matplotlib.text.Text at 0x7fd3eaadc358>

Then, print out what are the most frequent rating values. Use value_counts() function for dataframe.


In [23]:
# TODO: filter out the rows with the min number of votes (5) and then `value_counts()` them. 
# sort the result to see what are the most common numbers. 
df['Rating'].value_counts()
# As you can see in the following output that 6.4 is most common rating.


Out[23]:
6.4    1017
6.6     932
6.2     923
5.8     922
6.0     883
6.8     877
5.6     861
5.4     835
7.0     818
5.2     815
4.8     760
7.2     735
7.4     734
7.6     689
5.0     673
8.2     655
4.6     640
7.8     621
4.4     611
4.2     565
8.0     550
4.0     465
3.8     412
8.4     366
8.6     307
3.6     295
3.4     289
3.2     250
8.8     243
2.8     226
3.0     216
2.6     133
2.4      81
2.2      60
2.0      40
1.8      28
1.4      23
1.6      16
1.0      15
1.2       6
9.2       5
9.0       5
9.4       5
9.8       2
9.6       2
dtype: int64

So, the most frequent values are not "x.0". Let's see the CDF.


In [24]:
# Plot the CDF of votes.

What's going on? The number of votes is heavily skewed and most datapoints are at the left end.


In [25]:
# TODO: plot the same thing but limit the xrange (xlim) to [0, 100].

Draw a histogram focused on the range [0, 10] to just see how many datapoints are there.


In [26]:
# TODO: set the xlim to [0, 10] adjust ylim and bins so that 
# we can see how many datapoints are there for each # of votes.

Let's assume that most 5 ratings are from 5 to 8 and see what we'll get. You can use itertools.product function to generate the fake ratings.


In [27]:
#list(product([5,6,7,8], repeat=5))[:10]

In [28]:
from itertools import product
from collections import Counter

c = Counter()
for x in product([5,6,7,8], repeat=5):
    c[str(round(np.mean(x), 1))]+=1
sorted(c.items(), key=lambda x: x[1], reverse=True)
    
# or sorted(Counter(str(round(np.mean(x), 1)) for x in product([5,6,7,8], repeat=5)).items(), key=lambda x: x[1], reverse=True)


Out[28]:
[('6.6', 155),
 ('6.4', 155),
 ('6.8', 135),
 ('6.2', 135),
 ('7.0', 101),
 ('6.0', 101),
 ('7.2', 65),
 ('5.8', 65),
 ('7.4', 35),
 ('5.6', 35),
 ('5.4', 15),
 ('7.6', 15),
 ('7.8', 5),
 ('5.2', 5),
 ('5.0', 1),
 ('8.0', 1)]

Boxplot

Let's look at the example data that we looked at during the class.


In [29]:
data = [-1, 3, 3, 4, 15, 16, 16, 17, 23, 24, 24, 25, 35, 36, 37, 46]

The numpy.percentile() function provides a way to calculate the percentiles. Note that using the option interpolation, you can specify which value to take when the percentile value lies in between numbers. The default is linear.


In [30]:
print(np.percentile(data, 25))
print(np.percentile(data, 50), np.median(data))
print(np.percentile(data, 75))


12.25
20.0 20.0
27.5

Can you explain why do you get those first and third quartile values? The first quantile value is not 4, not 15, and not 9.5. Why?

# TODO: explain Due to interpolation method, the first quartile value is 12.25. It is linear by default.

Let's draw a boxplot with matplotlib.


In [31]:
# TODO: draw a boxplot of the data
plt.boxplot(data)


Out[31]:
{'boxes': [<matplotlib.lines.Line2D at 0x7fd3e9163128>],
 'caps': [<matplotlib.lines.Line2D at 0x7fd3e9169ac8>,
  <matplotlib.lines.Line2D at 0x7fd3e9169d30>],
 'fliers': [<matplotlib.lines.Line2D at 0x7fd3e916dd30>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x7fd3e916d550>],
 'whiskers': [<matplotlib.lines.Line2D at 0x7fd3e9163ac8>,
  <matplotlib.lines.Line2D at 0x7fd3e9163d30>]}

In [ ]: